knitr::opts_chunk$set(echo = F, include = T,warning=F, message=F)
options(scientific=T, digits = 3)
# options(scipen=9, digits = 3)
At first, we preprocessed the raw data in Python to obtain a nicer data frame (as in the raw data, some columns are written in JSON format).
Graph of numnber of movies by company
## Var1 Freq
## 1 0
## 2 Action 588
## 3 Adventure 288
## 4 Animation 99
## 5 Comedy 634
## 6 Crime 141
## 7 Documentary 30
## 8 Drama 745
## 9 Family 38
## 10 Fantasy 93
## 11 Foreign 1
## 12 History 18
## 13 Horror 197
## 14 Music 20
## 15 Mystery 27
## 16 Romance 70
## 17 Science Fiction 79
## 18 Thriller 118
## 19 TV Movie 0
## 20 War 18
## 21 Western 22
We will drop the only 1 movie in Foreign genre as this genre is unpopular and 1 movie does not make sense in our prediction..
## 'data.frame': 3225 obs. of 14 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ budget : int 237000000 300000000 245000000 250000000 260000000 258000000 260000000 280000000 250000000 250000000 ...
## $ genres : Factor w/ 18 levels "Action","Adventure",..: 1 2 1 1 1 9 3 1 2 1 ...
## $ popularity: num 150.4 139.1 107.4 112.3 43.9 ...
## $ company : Factor w/ 6 levels "Others","Paramount Pictures",..: 1 5 3 1 5 3 5 5 6 6 ...
## $ date : Date, format: "2009-12-10" "2007-05-19" ...
## $ revenue : num 2.79e+09 9.61e+08 8.81e+08 1.08e+09 2.84e+08 ...
## $ runtime : num 162 169 148 165 132 139 100 141 153 151 ...
## $ title : Factor w/ 3224 levels "(500) Days of Summer",..: 259 1761 2129 2420 1265 2139 2256 260 1053 310 ...
## $ score : num 7.2 6.9 6.3 7.6 6.1 5.9 7.4 7.3 7.4 5.7 ...
## $ vote : int 11800 4500 4466 9106 2124 3576 3330 6767 5293 7004 ...
## $ genrecount: int 4 3 3 4 3 3 2 3 3 3 ...
## $ profit : num 2.55e+09 6.61e+08 6.36e+08 8.35e+08 2.41e+07 ...
## $ profitable: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## 'data.frame': 3225 obs. of 15 variables:
## $ budget : int 237000000 300000000 245000000 250000000 260000000 258000000 260000000 280000000 250000000 250000000 ...
## $ genres : Factor w/ 18 levels "Action","Adventure",..: 1 2 1 1 1 9 3 1 2 1 ...
## $ popularity: num 150.4 139.1 107.4 112.3 43.9 ...
## $ company : Factor w/ 6 levels "Others","Paramount Pictures",..: 1 5 3 1 5 3 5 5 6 6 ...
## $ date : Date, format: "2009-12-10" "2007-05-19" ...
## $ revenue : num 2.79e+09 9.61e+08 8.81e+08 1.08e+09 2.84e+08 ...
## $ runtime : num 162 169 148 165 132 139 100 141 153 151 ...
## $ title : Factor w/ 3224 levels "(500) Days of Summer",..: 259 1761 2129 2420 1265 2139 2256 260 1053 310 ...
## $ score : num 7.2 6.9 6.3 7.6 6.1 5.9 7.4 7.3 7.4 5.7 ...
## $ vote : int 11800 4500 4466 9106 2124 3576 3330 6767 5293 7004 ...
## $ profit : num 2.55e+09 6.61e+08 6.36e+08 8.35e+08 2.41e+07 ...
## $ profitable: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ season : Factor w/ 4 levels "Fall","Spring",..: 4 2 1 3 2 2 1 2 3 2 ...
## $ quarter : chr "Q4" "Q2" "Q4" "Q3" ...
## $ year : num 2009 2007 2015 2012 2012 ...
summary
## revenue budget popularity runtime
## Min. :5.00e+00 Min. :1.00e+00 Min. : 0 Min. : 41
## 1st Qu.:1.71e+07 1st Qu.:1.05e+07 1st Qu.: 10 1st Qu.: 96
## Median :5.52e+07 Median :2.50e+07 Median : 20 Median :107
## Mean :1.21e+08 Mean :4.07e+07 Mean : 29 Mean :111
## 3rd Qu.:1.46e+08 3rd Qu.:5.50e+07 3rd Qu.: 37 3rd Qu.:121
## Max. :2.79e+09 Max. :3.80e+08 Max. :876 Max. :338
## score vote profit
## Min. :2.30 Min. : 1 Min. :-1.66e+08
## 1st Qu.:5.80 1st Qu.: 179 1st Qu.: 2.52e+05
## Median :6.30 Median : 471 Median : 2.64e+07
## Mean :6.31 Mean : 978 Mean : 8.07e+07
## 3rd Qu.:6.90 3rd Qu.: 1148 3rd Qu.: 9.75e+07
## Max. :8.50 Max. :13752 Max. : 2.55e+09
## budget genres popularity
## Min. :1.00e+00 Drama :745 Min. : 0
## 1st Qu.:1.05e+07 Comedy :634 1st Qu.: 10
## Median :2.50e+07 Action :588 Median : 20
## Mean :4.07e+07 Adventure:288 Mean : 29
## 3rd Qu.:5.50e+07 Horror :197 3rd Qu.: 37
## Max. :3.80e+08 Crime :141 Max. :876
## (Other) :632
## company date revenue
## Others :1636 Min. :1916-09-04 Min. :5.00e+00
## Paramount Pictures: 255 1st Qu.:1998-09-10 1st Qu.:1.71e+07
## Sony Pictures : 277 Median :2005-07-20 Median :5.52e+07
## Universal Pictures: 338 Mean :2002-03-18 Mean :1.21e+08
## Walt Disney : 497 3rd Qu.:2010-11-11 3rd Qu.:1.46e+08
## Warner Bros : 222 Max. :2016-09-09 Max. :2.79e+09
##
## runtime title score
## Min. : 41 The Host : 2 Min. :2.30
## 1st Qu.: 96 (500) Days of Summer : 1 1st Qu.:5.80
## Median :107 [REC] : 1 Median :6.30
## Mean :111 [REC]² : 1 Mean :6.31
## 3rd Qu.:121 10 Cloverfield Lane : 1 3rd Qu.:6.90
## Max. :338 10 Things I Hate About You: 1 Max. :8.50
## (Other) :3218
## vote profit profitable season
## Min. : 1 Min. :-1.66e+08 0: 787 Fall :930
## 1st Qu.: 179 1st Qu.: 2.52e+05 1:2438 Spring:704
## Median : 471 Median : 2.64e+07 Summer:837
## Mean : 978 Mean : 8.07e+07 Winter:754
## 3rd Qu.: 1148 3rd Qu.: 9.75e+07
## Max. :13752 Max. : 2.55e+09
##
## quarter year
## Length:3225 Min. :1916
## Class :character 1st Qu.:1998
## Mode :character Median :2005
## Mean :2002
## 3rd Qu.:2010
## Max. :2016
##
Variance and SD
## revenue budget popularity runtime score vote
## 1.86e+08 4.44e+07 3.62e+01 2.10e+01 8.60e-01 1.41e+03
## profit
## 1.58e+08
## revenue budget popularity runtime score vote
## 3.47e+16 1.97e+15 1.31e+03 4.40e+02 7.39e-01 2.00e+06
## profit
## 2.50e+16
The means, variance and sd between variables are quite high as most of them have different scales. We need to scale the data for some models like linear regression, PCR, KNN …
Test freq distribution of different genres in revenue
## Df Sum Sq Mean Sq F value Pr(>F)
## genres 17 1.4e+19 8.21e+17 26.9 <2e-16 ***
## Residuals 3207 9.8e+19 3.06e+16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Overall, there is an evidence that the frequency distributions in profit by different genres are not the same. It seems that profit is dependent on genres.
Check freq disbution of different companies in revenue
## Df Sum Sq Mean Sq F value Pr(>F)
## company 5 4.80e+18 9.59e+17 28.8 <2e-16 ***
## Residuals 3219 1.07e+20 3.33e+16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = revenue ~ company, data = movie)
##
## $company
## diff lwr upr p adj
## Paramount Pictures-Others 67454527 3.24e+07 102485737 0.000
## Sony Pictures-Others 49800557 1.60e+07 83606812 0.000
## Universal Pictures-Others 89325542 5.82e+07 120413679 0.000
## Walt Disney-Others 89174826 6.25e+07 115824795 0.000
## Warner Bros-Others 29868762 -7.35e+06 67084432 0.199
## Sony Pictures-Paramount Pictures -17653970 -6.28e+07 27502185 0.875
## Universal Pictures-Paramount Pictures 21871015 -2.13e+07 65029881 0.699
## Walt Disney-Paramount Pictures 21720299 -1.84e+07 61800672 0.635
## Warner Bros-Paramount Pictures -37585764 -8.53e+07 10176371 0.218
## Universal Pictures-Sony Pictures 39524985 -2.65e+06 81695650 0.081
## Walt Disney-Sony Pictures 39374270 3.60e+05 78388543 0.046
## Warner Bros-Sony Pictures -19931794 -6.68e+07 26939292 0.831
## Walt Disney-Universal Pictures -150716 -3.68e+07 36533379 1.000
## Warner Bros-Universal Pictures -59456780 -1.04e+08 -14506718 0.002
## Warner Bros-Walt Disney -59306064 -1.01e+08 -17303008 0.001
Overall, there is an evidence that the frequency distributions in profit by different companies are not the same. It seems that profit is dependent on companies.
## Df Sum Sq Mean Sq F value Pr(>F)
## season 3 1.59e+18 5.31e+17 15.5 5.3e-10 ***
## Residuals 3221 1.10e+20 3.43e+16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = revenue ~ season, data = movie)
##
## $season
## diff lwr upr p adj
## Spring-Fall 46272179 22500396 70043962 0.000
## Summer-Fall 49081296 26409934 71752657 0.000
## Winter-Fall 8412966 -14905901 31731832 0.790
## Summer-Spring 2809117 -21525012 27143245 0.991
## Winter-Spring -37859214 -62797712 -12920716 0.001
## Winter-Summer -40668330 -64560204 -16776456 0.000
It seems that winter and fall are in the same group and spring and summer are in the same group.
Construc the model on Train set (using all numberical variables)
##
## Call:
## lm(formula = revenue ~ ., data = train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.21e+08 -3.89e+07 -1.92e+06 2.46e+07 1.60e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 122161921 2168480 56.34 < 2e-16 ***
## budget 82502895 2822244 29.23 < 2e-16 ***
## popularity 14588304 2952684 4.94 8.4e-07 ***
## runtime -1265467 2415512 -0.52 0.60
## score 212212 2648862 0.08 0.94
## vote 85807055 3723449 23.05 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.01e+08 on 2170 degrees of freedom
## Multiple R-squared: 0.711, Adjusted R-squared: 0.711
## F-statistic: 1.07e+03 on 5 and 2170 DF, p-value: <2e-16
## budget popularity runtime score vote
## 1.68 2.15 1.26 1.50 2.95
Test
## mae mse rmse mape
## 6.25e+07 1.12e+16 1.06e+08 1.06e+04
## mae mse rmse mape
## 5.86e+07 1.02e+16 1.01e+08 4.76e+03
Feature selection
All three feature selection methods show that predictors (budget + popularity + vote) form the best model.
Construct model on train set
##
## Call:
## lm(formula = revenue ~ budget + popularity + vote, data = train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.21e+08 -3.85e+07 -2.19e+06 2.44e+07 1.60e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.22e+08 2.17e+06 56.36 < 2e-16 ***
## budget 8.23e+07 2.62e+06 31.45 < 2e-16 ***
## popularity 1.46e+07 2.95e+06 4.96 7.8e-07 ***
## vote 8.57e+07 3.47e+06 24.72 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.01e+08 on 2172 degrees of freedom
## Multiple R-squared: 0.711, Adjusted R-squared: 0.711
## F-statistic: 1.78e+03 on 3 and 2172 DF, p-value: <2e-16
## budget popularity vote
## 1.45 2.15 2.55
Predict model on Test set
## mae mse rmse mape
## 6.25e+07 1.12e+16 1.06e+08 1.06e+04
## mae mse rmse mape
## 5.86e+07 1.02e+16 1.01e+08 4.76e+03
It seems that each season has the same effect on the model.
##
## Call:
## lm(formula = revenue ~ ., data = train1_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.08e+08 -4.03e+07 -1.43e+06 2.90e+07 1.62e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98181070 6527080 15.04 < 2e-16 ***
## budget 77348867 3033574 25.50 < 2e-16 ***
## popularity 13617667 2932282 4.64 3.6e-06 ***
## runtime 5900196 2594312 2.27 0.02305 *
## score -1602884 2751829 -0.58 0.56030
## vote 87028064 3716646 23.42 < 2e-16 ***
## genresAdventure 14998048 8669568 1.73 0.08378 .
## genresAnimation 87790816 13236699 6.63 4.2e-11 ***
## genresComedy 25268186 7163941 3.53 0.00043 ***
## genresCrime -10102172 11467478 -0.88 0.37845
## genresDocumentary 44711638 23188936 1.93 0.05397 .
## genresDrama 4923647 7294556 0.67 0.49976
## genresFamily 82467003 20391297 4.04 5.4e-05 ***
## genresFantasy 2870822 13660650 0.21 0.83357
## genresHistory 15689738 24998838 0.63 0.53032
## genresHorror 17619142 10531825 1.67 0.09448 .
## genresMusic 23079584 29354497 0.79 0.43182
## genresMystery 6103862 24024211 0.25 0.79946
## genresRomance 17057626 14926772 1.14 0.25327
## genresScience Fiction -15440346 15106799 -1.02 0.30686
## genresThriller -7679779 12337527 -0.62 0.53370
## genresWar -56563625 33719837 -1.68 0.09360 .
## genresWestern -3456414 24929422 -0.14 0.88974
## companyParamount Pictures 19120256 8303147 2.30 0.02139 *
## companySony Pictures 4555821 7970621 0.57 0.56767
## companyUniversal Pictures 16743343 7455117 2.25 0.02481 *
## companyWalt Disney 16706288 6373478 2.62 0.00882 **
## companyWarner Bros -535044 8559594 -0.06 0.95016
## seasonSpring 9929435 6117507 1.62 0.10471
## seasonSummer 9106694 5938990 1.53 0.12533
## seasonWinter 4037639 5967395 0.68 0.49872
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 99300000 on 2145 degrees of freedom
## Multiple R-squared: 0.725, Adjusted R-squared: 0.721
## F-statistic: 188 on 30 and 2145 DF, p-value: <2e-16
## budget popularity
## 2.01 2.20
## runtime score
## 1.51 1.68
## vote genresAdventure
## 3.04 1.40
## genresAnimation genresComedy
## 1.25 1.79
## genresCrime genresDocumentary
## 1.26 1.08
## genresDrama genresFamily
## 2.06 1.08
## genresFantasy genresHistory
## 1.14 1.07
## genresHorror genresMusic
## 1.32 1.04
## genresMystery genresRomance
## 1.04 1.13
## genresScience Fiction genresThriller
## 1.11 1.18
## genresWar genresWestern
## 1.03 1.06
## companyParamount Pictures companySony Pictures
## 1.09 1.11
## companyUniversal Pictures companyWalt Disney
## 1.12 1.18
## companyWarner Bros seasonSpring
## 1.08 1.44
## seasonSummer seasonWinter
## 1.48 1.44
The model indicates no significance among seasons. Season seems no to be a necessary predictor.
Feature Selection
When inlcuding the season, genre and company in the model, the best numerical predictors are still budget, popularity and vote. We will build the model with these 3 predictors and 2 categorical variables genre and company.
##
## Call:
## lm(formula = revenue ~ budget + vote + company + genres + popularity,
## data = train1_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.14e+08 -4.04e+07 -8.25e+05 2.96e+07 1.62e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103783104 5493099 18.89 < 2e-16 ***
## budget 79307464 2810718 28.22 < 2e-16 ***
## vote 87343750 3432074 25.45 < 2e-16 ***
## companyParamount Pictures 20187352 8296723 2.43 0.01505 *
## companySony Pictures 4706890 7968099 0.59 0.55477
## companyUniversal Pictures 17875529 7436574 2.40 0.01631 *
## companyWalt Disney 16364447 6369330 2.57 0.01026 *
## companyWarner Bros -135185 8558273 -0.02 0.98740
## genresAdventure 14924242 8645045 1.73 0.08443 .
## genresAnimation 80203108 12811282 6.26 4.6e-10 ***
## genresComedy 24197175 7152449 3.38 0.00073 ***
## genresCrime -8821228 11302914 -0.78 0.43522
## genresDocumentary 40500854 22990580 1.76 0.07827 .
## genresDrama 6560315 6935298 0.95 0.34429
## genresFamily 78112299 20296238 3.85 0.00012 ***
## genresFantasy 1516100 13618042 0.11 0.91136
## genresHistory 22335236 24701813 0.90 0.36599
## genresHorror 15897705 10491528 1.52 0.12985
## genresMusic 22199706 29264598 0.76 0.44818
## genresMystery 3544411 24017226 0.15 0.88269
## genresRomance 16669177 14880624 1.12 0.26276
## genresScience Fiction -15795028 15101070 -1.05 0.29570
## genresThriller -7928898 12337416 -0.64 0.52051
## genresWar -55041648 33646312 -1.64 0.10201
## genresWestern 726641 24731085 0.03 0.97656
## popularity 13447493 2930278 4.59 4.7e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 99400000 on 2150 degrees of freedom
## Multiple R-squared: 0.724, Adjusted R-squared: 0.721
## F-statistic: 225 on 25 and 2150 DF, p-value: <2e-16
## budget vote
## 1.73 2.59
## companyParamount Pictures companySony Pictures
## 1.09 1.11
## companyUniversal Pictures companyWalt Disney
## 1.11 1.17
## companyWarner Bros genresAdventure
## 1.08 1.39
## genresAnimation genresComedy
## 1.17 1.78
## genresCrime genresDocumentary
## 1.22 1.06
## genresDrama genresFamily
## 1.86 1.07
## genresFantasy genresHistory
## 1.13 1.04
## genresHorror genresMusic
## 1.30 1.03
## genresMystery genresRomance
## 1.04 1.12
## genresScience Fiction genresThriller
## 1.11 1.17
## genresWar genresWestern
## 1.03 1.04
## popularity
## 2.19
The adj R-squared increases by 1% comparing to the the best model with numerical variables.
Prediction
## mae mse rmse mape
## 6.19e+07 1.09e+16 1.05e+08 7.33e+03
## mae mse rmse mape
## 5.83e+07 9.76e+15 9.88e+07 8.55e+03
## [1] 86399
## [1] 86439
## [1] "------"
## [1] 86395
## [1] 86424
## [1] "-------"
## [1] 86343
## [1] 86497
###Check variance
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.89e+08 3.10e+07 926 23.9 20.1 0.699
## Proportion of Variance 9.74e-01 2.62e-02 0 0.0 0.0 0.000
## Cumulative Proportion 9.74e-01 1.00e+00 1 1.0 1.0 1.000
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.768 1.108 0.894 0.6466 0.5045 0.4186
## Proportion of Variance 0.521 0.204 0.133 0.0697 0.0424 0.0292
## Cumulative Proportion 0.521 0.725 0.859 0.9284 0.9708 1.0000
Train
## Data: X dimension: 2176 5
## Y dimension: 2176 1
## Fit method: svdpc
## Number of components considered: 5
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## CV 1.88e+08 122329562 106688281 106237668 103730433 102073627
## adjCV 1.88e+08 122230125 106520321 106183882 103659547 102017104
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 48.50 71.69 87.41 95.61 100.00
## revenue 58.21 68.09 68.67 70.49 71.13
## Data: X dimension: 2176 5
## Y dimension: 2176 1
## Fit method: svdpc
## Number of components considered: 5
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## CV 1.88e+08 122602547 109625493 107879872 103007610 103518646
## adjCV 1.88e+08 122517816 109502905 107831225 102938948 103373422
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 49.00 71.71 87.31 95.63 100.00
## revenue 57.87 66.61 67.94 70.47 71.13
It shows that with 2 components more than 90% of the variance of the data.
Scaled data have better variance explanation for revenue than non-scaled data.
Let’s try the pcr model on test data
There is a significant increase in the variance from PC1 to PC2, after that the change of variance is not too drastic. We can say that with 2 principal components we captured the most variance.
Try linear model with predictors as PCs:
##
## Call:
## lm(formula = revenue ~ PC1 + PC2, data = movie_pcr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11e+09 -4.03e+07 -4.95e+06 2.42e+07 1.73e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 120012282 2277573 52.7 <2e-16 ***
## PC1 -92108152 1462914 -63.0 <2e-16 ***
## PC2 54901946 2115458 25.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.06e+08 on 2173 degrees of freedom
## Multiple R-squared: 0.681, Adjusted R-squared: 0.681
## F-statistic: 2.32e+03 on 2 and 2173 DF, p-value: <2e-16
## PC1 PC2
## 1 1
##
## Call:
## lm(formula = revenue ~ PC1 + PC2, data = movie_pcr.nc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20e+09 -4.08e+07 -5.95e+06 2.28e+07 1.76e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 121493614 2330120 52.1 <2e-16 ***
## PC1 -89813816 1463518 -61.4 <2e-16 ***
## PC2 51258272 2149532 23.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.09e+08 on 2173 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.666
## F-statistic: 2.17e+03 on 2 and 2173 DF, p-value: <2e-16
## PC1 PC2
## 1 1
## [1] 86611
## [1] 86633
## [1] 86710
## [1] 86732
##
## Pearson's Chi-squared test
##
## data: contable1
## X-squared = 54, df = 17, p-value = 1e-05
##
## Pearson's Chi-squared test
##
## data: contable2
## X-squared = 93, df = 5, p-value <2e-16
##
## Pearson's Chi-squared test
##
## data: contable3
## X-squared = 20, df = 3, p-value = 1e-04
In this part we will construc the logit model on the whole dataset.
## 'data.frame': 3225 obs. of 10 variables:
## $ revenue : num 2.79e+09 9.61e+08 8.81e+08 1.08e+09 2.84e+08 ...
## $ budget : int 237000000 300000000 245000000 250000000 260000000 258000000 260000000 280000000 250000000 250000000 ...
## $ popularity: num 150.4 139.1 107.4 112.3 43.9 ...
## $ runtime : num 162 169 148 165 132 139 100 141 153 151 ...
## $ score : num 7.2 6.9 6.3 7.6 6.1 5.9 7.4 7.3 7.4 5.7 ...
## $ vote : int 11800 4500 4466 9106 2124 3576 3330 6767 5293 7004 ...
## $ genres : Factor w/ 18 levels "Action","Adventure",..: 1 2 1 1 1 9 3 1 2 1 ...
## $ company : Factor w/ 6 levels "Others","Paramount Pictures",..: 1 5 3 1 5 3 5 5 6 6 ...
## $ season : Factor w/ 4 levels "Fall","Spring",..: 4 2 1 3 2 2 1 2 3 2 ...
## $ y : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
Budget and revenue are enough to decide the profitable as the profit is calculated as revenue subtracted by budget. Our pre-test with “bestglm” also shows the same result.
## revenue budget popularity runtime score vote genres company season
## 1 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## 3 TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## 4 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
## 5 TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
## Criterion
## 1 17.4
## 2 17.6
## 3 18.1
## 4 18.3
## 5 18.5
However, since the relationship between (revenue + budget) and profitable is too direct, we better not use them together.
In reality, we prefer budget rather than revenue to predict profit. A film manager would want to have a prediction of the profit of a movie before its main released date. The information he/she have are the budget, runtime, genres, production company, popularity, vote and score (vote and score can be obtained by a preview screening of a movie, popularity can be generated after advertisement, trailers and some leaks from a movie).
Let’s try the model with budget and other predictors without revenue
##
## Call:
## glm(formula = y ~ ., family = "binomial", data = movie_nd[-c(1)])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.467 0.000 0.293 0.728 1.866
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.31e+00 4.46e-01 -5.19 2.1e-07 ***
## budget -1.69e-08 2.03e-09 -8.34 < 2e-16 ***
## popularity 2.21e-02 9.82e-03 2.25 0.02448 *
## runtime 7.95e-04 2.75e-03 0.29 0.77253
## score 2.51e-01 7.05e-02 3.57 0.00036 ***
## vote 2.43e-03 3.38e-04 7.19 6.7e-13 ***
## genresAdventure 3.44e-02 2.10e-01 0.16 0.86979
## genresAnimation 4.27e-01 3.49e-01 1.22 0.22116
## genresComedy 4.48e-01 1.56e-01 2.88 0.00398 **
## genresCrime 4.70e-02 2.50e-01 0.19 0.85105
## genresDocumentary 6.89e-01 4.42e-01 1.56 0.11943
## genresDrama 4.30e-02 1.55e-01 0.28 0.78137
## genresFamily 4.42e-01 4.59e-01 0.96 0.33546
## genresFantasy 3.43e-01 3.58e-01 0.96 0.33770
## genresHistory 3.96e-01 6.37e-01 0.62 0.53387
## genresHorror 1.01e+00 2.66e-01 3.80 0.00015 ***
## genresMusic 4.91e-01 5.58e-01 0.88 0.37926
## genresMystery -1.80e-02 5.23e-01 -0.03 0.97260
## genresRomance 6.07e-01 3.53e-01 1.72 0.08515 .
## genresScience Fiction -7.18e-02 3.82e-01 -0.19 0.85103
## genresThriller 1.78e-02 2.76e-01 0.06 0.94866
## genresWar -1.27e+00 6.10e-01 -2.08 0.03707 *
## genresWestern 2.04e+00 8.44e-01 2.42 0.01548 *
## companyParamount Pictures 9.52e-01 1.97e-01 4.83 1.4e-06 ***
## companySony Pictures 7.01e-01 1.82e-01 3.86 0.00011 ***
## companyUniversal Pictures 8.18e-01 1.91e-01 4.29 1.8e-05 ***
## companyWalt Disney 8.96e-01 1.52e-01 5.90 3.6e-09 ***
## companyWarner Bros 6.51e-01 2.01e-01 3.23 0.00123 **
## seasonSpring 2.71e-01 1.35e-01 2.01 0.04415 *
## seasonSummer 4.75e-01 1.32e-01 3.58 0.00034 ***
## seasonWinter 2.85e-01 1.29e-01 2.21 0.02743 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3584.1 on 3224 degrees of freedom
## Residual deviance: 2656.1 on 3194 degrees of freedom
## AIC: 2718
##
## Number of Fisher Scoring iterations: 8
## (Intercept) budget
## 0.0989 1.0000
## popularity runtime
## 1.0223 1.0008
## score vote
## 1.2858 1.0024
## genresAdventure genresAnimation
## 1.0350 1.5325
## genresComedy genresCrime
## 1.5653 1.0481
## genresDocumentary genresDrama
## 1.9909 1.0439
## genresFamily genresFantasy
## 1.5562 1.4095
## genresHistory genresHorror
## 1.4861 2.7467
## genresMusic genresMystery
## 1.6340 0.9822
## genresRomance genresScience Fiction
## 1.8350 0.9307
## genresThriller genresWar
## 1.0179 0.2806
## genresWestern companyParamount Pictures
## 7.7111 2.5912
## companySony Pictures companyUniversal Pictures
## 2.0167 2.2661
## companyWalt Disney companyWarner Bros
## 2.4501 1.9179
## seasonSpring seasonSummer
## 1.3110 1.6080
## seasonWinter
## 1.3304
## budget popularity runtime score vote genres company season Criterion
## 1 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE 2714
## 2 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 2716
## 3 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE 2717
## 4 TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE 2719
## 5 TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE 2722
Build model without runtime
##
## Call:
## glm(formula = y ~ budget + popularity + score + vote + genres +
## company + season, family = "binomial", data = movie_nd[-c(1)])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.462 0.000 0.294 0.729 1.880
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.27e+00 4.21e-01 -5.39 6.9e-08 ***
## budget -1.68e-08 1.96e-09 -8.55 < 2e-16 ***
## popularity 2.22e-02 9.80e-03 2.26 0.02376 *
## score 2.58e-01 6.66e-02 3.88 0.00011 ***
## vote 2.42e-03 3.37e-04 7.19 6.3e-13 ***
## genresAdventure 3.26e-02 2.10e-01 0.16 0.87638
## genresAnimation 4.08e-01 3.43e-01 1.19 0.23407
## genresComedy 4.45e-01 1.55e-01 2.87 0.00415 **
## genresCrime 5.04e-02 2.50e-01 0.20 0.83998
## genresDocumentary 6.78e-01 4.41e-01 1.54 0.12380
## genresDrama 4.93e-02 1.53e-01 0.32 0.74762
## genresFamily 4.30e-01 4.57e-01 0.94 0.34700
## genresFantasy 3.40e-01 3.58e-01 0.95 0.34133
## genresHistory 4.20e-01 6.32e-01 0.66 0.50651
## genresHorror 1.01e+00 2.66e-01 3.79 0.00015 ***
## genresMusic 4.88e-01 5.58e-01 0.87 0.38256
## genresMystery -2.19e-02 5.23e-01 -0.04 0.96663
## genresRomance 6.05e-01 3.53e-01 1.72 0.08601 .
## genresScience Fiction -7.18e-02 3.82e-01 -0.19 0.85105
## genresThriller 1.97e-02 2.76e-01 0.07 0.94315
## genresWar -1.26e+00 6.09e-01 -2.07 0.03823 *
## genresWestern 2.05e+00 8.44e-01 2.43 0.01524 *
## companyParamount Pictures 9.51e-01 1.97e-01 4.82 1.4e-06 ***
## companySony Pictures 7.01e-01 1.82e-01 3.86 0.00012 ***
## companyUniversal Pictures 8.19e-01 1.91e-01 4.29 1.8e-05 ***
## companyWalt Disney 8.94e-01 1.52e-01 5.90 3.7e-09 ***
## companyWarner Bros 6.52e-01 2.01e-01 3.24 0.00121 **
## seasonSpring 2.71e-01 1.35e-01 2.01 0.04422 *
## seasonSummer 4.75e-01 1.32e-01 3.58 0.00034 ***
## seasonWinter 2.86e-01 1.29e-01 2.21 0.02711 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3584.1 on 3224 degrees of freedom
## Residual deviance: 2656.2 on 3195 degrees of freedom
## AIC: 2716
##
## Number of Fisher Scoring iterations: 8
## $`companyParamount Pictures`
## [1] 2.59
##
## $`companySony Pictures`
## [1] 2.01
##
## $`companyUniversal Pictures`
## [1] 2.27
##
## $`companyWalt Disney`
## [1] 2.45
##
## $`companyWarner Bros`
## [1] 1.92
## $seasonSpring
## [1] 1.31
##
## $seasonSummer
## [1] 1.61
##
## $seasonWinter
## [1] 1.33
Check the effect of genres and companies
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 68.1, df = 5, P(> X2) = 2.6e-13
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 39.2, df = 17, P(> X2) = 0.0017
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 13.6, df = 3, P(> X2) = 0.0036
We can validate the model with some methods:
Hosmer and Lemeshow test
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: movie_nd[-c(1)]$y, fitted(prf_glm)
## X-squared = 3225, df = 8, p-value <2e-16
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: movie_nd[-c(1)]$y, fitted(prf_glm0)
## X-squared = 3225, df = 8, p-value <2e-16
ROC curve and AUC:
## Area under the curve: 0.841
## Area under the curve: 0.841
McFadden
## llh llhNull G2 McFadden r2ML r2CU
## -1328.072 -1792.074 928.005 0.259 0.250 0.373
## llh llhNull G2 McFadden r2ML r2CU
## -1328.114 -1792.074 927.921 0.259 0.250 0.373
Let’s test with revenue, assuming that the movie is released in a particular in a region and we obtain the revenue data for predicting if the movie will earn profit or not.
Let’s see which is the best model with revenue included and budget excluded
## revenue popularity runtime score vote genres company season Criterion
## 1 TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE 2070
## 2 TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE 2070
## 3 TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE 2071
## 4 TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE 2072
## 5 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE 2073
In this model, we remove budget and do not count company and season according to the model selection.
##
## Call:
## glm(formula = y ~ revenue + popularity + score + vote + genres +
## runtime, family = "binomial", data = movie_nd[-c(2)])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.949 0.000 0.050 0.497 2.283
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.67e+00 5.11e-01 -9.13 < 2e-16 ***
## revenue 5.43e-08 2.89e-09 18.78 < 2e-16 ***
## popularity -1.80e-02 8.73e-03 -2.06 0.03959 *
## score 9.74e-01 8.32e-02 11.72 < 2e-16 ***
## vote 3.75e-04 2.68e-04 1.40 0.16175
## genresAdventure -6.59e-01 2.65e-01 -2.48 0.01301 *
## genresAnimation -1.59e+00 4.98e-01 -3.19 0.00143 **
## genresComedy 6.29e-01 1.80e-01 3.49 0.00048 ***
## genresCrime 3.55e-01 2.78e-01 1.28 0.20098
## genresDocumentary 4.01e-01 4.69e-01 0.85 0.39302
## genresDrama 3.15e-01 1.77e-01 1.77 0.07626 .
## genresFamily -4.24e-01 5.66e-01 -0.75 0.45389
## genresFantasy 2.30e-01 4.18e-01 0.55 0.58251
## genresHistory 9.61e-01 7.26e-01 1.32 0.18550
## genresHorror 1.73e+00 2.86e-01 6.06 1.4e-09 ***
## genresMusic 3.26e-01 5.90e-01 0.55 0.58028
## genresMystery 2.57e-01 5.86e-01 0.44 0.66115
## genresRomance 5.81e-01 3.98e-01 1.46 0.14448
## genresScience Fiction 1.16e-01 4.36e-01 0.27 0.79065
## genresThriller 5.15e-01 3.15e-01 1.63 0.10243
## genresWar -1.38e+00 7.22e-01 -1.90 0.05685 .
## genresWestern 2.49e+00 8.07e-01 3.08 0.00204 **
## runtime -2.36e-02 3.28e-03 -7.19 6.4e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3584.1 on 3224 degrees of freedom
## Residual deviance: 2026.0 on 3202 degrees of freedom
## AIC: 2072
##
## Number of Fisher Scoring iterations: 8
Better AIC than the above model. As expected since the the profit is shown to be more related to revenue as in the corplot
## llh llhNull G2 McFadden r2ML r2CU
## -1013.011 -1792.074 1558.127 0.435 0.383 0.571
43.5% y variation is explained.
## Area under the curve: 0.913
## [1] 2716
## [1] 2899
## [1] "---"
## [1] 2072
## [1] 2212
Overall, I do not prefer the model using revenue as revenue is strongly correlated to profit (so it might explain the profit status very well), and revenue comes after the release of movie so it does not make sense for prediction. We should use known variables before the release of a movie such as budget, genres, company, vote, runtime, score, …
Some may argue that we can obtain revenue in a region as a sample for the prediction, however, it is clear that the regional revenue is not a representative for the world revenue.
We also used the chi-test and see that profit status is dependent on company and season but with revenue we can ignore these variables. It seems not to be a practical case.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 686 920
## 1 101 1518
##
## Accuracy : 0.683
## 95% CI : (0.667, 0.699)
## No Information Rate : 0.756
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.366
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.872
## Specificity : 0.623
## Pos Pred Value : 0.427
## Neg Pred Value : 0.938
## Prevalence : 0.244
## Detection Rate : 0.213
## Detection Prevalence : 0.498
## Balanced Accuracy : 0.747
##
## 'Positive' Class : 0
##
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: y
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 3224 3584
## budget 1 40 3223 3545 3.2e-10 ***
## popularity 1 619 3222 2925 < 2e-16 ***
## runtime 1 0 3221 2925 0.99475
## score 1 28 3220 2897 1.2e-07 ***
## vote 1 109 3219 2788 < 2e-16 ***
## genres 17 45 3202 2743 0.00021 ***
## company 5 73 3197 2670 2.4e-14 ***
## season 3 14 3194 2656 0.00348 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s see between genres and company, which factors are better for the model.
##
## Call:
## glm(formula = y ~ budget + popularity + score + vote + genres,
## family = "binomial", data = movie_nd[-c(1)])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.487 0.000 0.309 0.769 1.765
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.65e+00 3.98e-01 -4.15 3.4e-05 ***
## budget -1.38e-08 1.90e-09 -7.29 3.0e-13 ***
## popularity 2.32e-02 9.71e-03 2.39 0.01702 *
## score 2.40e-01 6.49e-02 3.70 0.00021 ***
## vote 2.34e-03 3.31e-04 7.06 1.7e-12 ***
## genresAdventure 3.25e-02 2.06e-01 0.16 0.87496
## genresAnimation 3.89e-01 3.31e-01 1.18 0.23907
## genresComedy 5.09e-01 1.52e-01 3.36 0.00078 ***
## genresCrime 9.62e-03 2.45e-01 0.04 0.96867
## genresDocumentary 5.08e-01 4.32e-01 1.18 0.23956
## genresDrama 1.80e-02 1.50e-01 0.12 0.90431
## genresFamily 5.23e-01 4.49e-01 1.16 0.24408
## genresFantasy 4.55e-01 3.51e-01 1.30 0.19498
## genresHistory 5.63e-01 6.32e-01 0.89 0.37299
## genresHorror 1.01e+00 2.60e-01 3.87 0.00011 ***
## genresMusic 4.25e-01 5.34e-01 0.80 0.42581
## genresMystery -4.13e-02 5.12e-01 -0.08 0.93561
## genresRomance 5.27e-01 3.44e-01 1.53 0.12585
## genresScience Fiction -2.29e-02 3.71e-01 -0.06 0.95085
## genresThriller -1.59e-01 2.70e-01 -0.59 0.55721
## genresWar -1.09e+00 5.89e-01 -1.85 0.06418 .
## genresWestern 1.73e+00 8.12e-01 2.13 0.03331 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3584.1 on 3224 degrees of freedom
## Residual deviance: 2742.8 on 3203 degrees of freedom
## AIC: 2787
##
## Number of Fisher Scoring iterations: 8
##
## Call:
## glm(formula = y ~ budget + popularity + score + vote + genres,
## family = "binomial", data = movie_nd[-c(1)])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.487 0.000 0.309 0.769 1.765
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.65e+00 3.98e-01 -4.15 3.4e-05 ***
## budget -1.38e-08 1.90e-09 -7.29 3.0e-13 ***
## popularity 2.32e-02 9.71e-03 2.39 0.01702 *
## score 2.40e-01 6.49e-02 3.70 0.00021 ***
## vote 2.34e-03 3.31e-04 7.06 1.7e-12 ***
## genresAdventure 3.25e-02 2.06e-01 0.16 0.87496
## genresAnimation 3.89e-01 3.31e-01 1.18 0.23907
## genresComedy 5.09e-01 1.52e-01 3.36 0.00078 ***
## genresCrime 9.62e-03 2.45e-01 0.04 0.96867
## genresDocumentary 5.08e-01 4.32e-01 1.18 0.23956
## genresDrama 1.80e-02 1.50e-01 0.12 0.90431
## genresFamily 5.23e-01 4.49e-01 1.16 0.24408
## genresFantasy 4.55e-01 3.51e-01 1.30 0.19498
## genresHistory 5.63e-01 6.32e-01 0.89 0.37299
## genresHorror 1.01e+00 2.60e-01 3.87 0.00011 ***
## genresMusic 4.25e-01 5.34e-01 0.80 0.42581
## genresMystery -4.13e-02 5.12e-01 -0.08 0.93561
## genresRomance 5.27e-01 3.44e-01 1.53 0.12585
## genresScience Fiction -2.29e-02 3.71e-01 -0.06 0.95085
## genresThriller -1.59e-01 2.70e-01 -0.59 0.55721
## genresWar -1.09e+00 5.89e-01 -1.85 0.06418 .
## genresWestern 1.73e+00 8.12e-01 2.13 0.03331 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3584.1 on 3224 degrees of freedom
## Residual deviance: 2742.8 on 3203 degrees of freedom
## AIC: 2787
##
## Number of Fisher Scoring iterations: 8
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1049
##
##
## | season_knn
## test2$season | Fall | Spring | Summer | Winter | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|
## Fall | 113 | 71 | 70 | 73 | 327 |
## | 0.346 | 0.217 | 0.214 | 0.223 | 0.312 |
## | 0.365 | 0.300 | 0.268 | 0.303 | |
## | 0.108 | 0.068 | 0.067 | 0.070 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## Spring | 51 | 48 | 75 | 40 | 214 |
## | 0.238 | 0.224 | 0.350 | 0.187 | 0.204 |
## | 0.165 | 0.203 | 0.287 | 0.166 | |
## | 0.049 | 0.046 | 0.071 | 0.038 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## Summer | 82 | 68 | 59 | 72 | 281 |
## | 0.292 | 0.242 | 0.210 | 0.256 | 0.268 |
## | 0.265 | 0.287 | 0.226 | 0.299 | |
## | 0.078 | 0.065 | 0.056 | 0.069 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## Winter | 64 | 50 | 57 | 56 | 227 |
## | 0.282 | 0.220 | 0.251 | 0.247 | 0.216 |
## | 0.206 | 0.211 | 0.218 | 0.232 | |
## | 0.061 | 0.048 | 0.054 | 0.053 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 310 | 237 | 261 | 241 | 1049 |
## | 0.296 | 0.226 | 0.249 | 0.230 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
##
##
k= 35
k = 27
k= 17
## 'data.frame': 2669 obs. of 3 variables:
## $ revenue: num 2.64e+08 3.37e+08 1.00e+07 3.66e+08 1.28e+08 ...
## $ year : num 1995 1995 1995 1995 1995 ...
## $ quarter: chr "Q3" "Q2" "Q4" "Q2" ...
## # A tibble: 87 x 3
## # Groups: year [22]
## year quarter revenue
## <dbl> <fct> <dbl>
## 1 1995 Q1 252877967
## 2 1995 Q2 2630639816
## 3 1995 Q3 1562025856
## 4 1995 Q4 1637810344
## 5 1996 Q1 399881330
## 6 1996 Q2 3002163064
## 7 1996 Q3 905547568
## 8 1996 Q4 2465830133
## 9 1997 Q1 722506511
## 10 1997 Q2 2052763938
## # … with 77 more rows
## Qtr1 Qtr2 Qtr3 Qtr4
## 2011 3.16e+09 7.61e+09 4.39e+09 5.30e+09
## 2012 3.88e+09 8.56e+09 4.42e+09 6.92e+09
## 2013 3.86e+09 7.01e+09 5.34e+09 6.94e+09
## 2014 4.90e+09 6.67e+09 4.18e+09 8.30e+09
## 2015 3.27e+09 9.24e+09 5.34e+09 4.62e+09
## 2016 4.22e+09 7.70e+09 2.53e+09
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.55e+09 -6.35e+08 2.92e+08 0.00e+00 9.27e+08 9.67e+08
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1.05e+09 -4.67e+08 -8.91e+07 -3.50e+06 4.29e+08 1.99e+09 4
## Holt-Winters exponential smoothing with trend and additive seasonal component.
##
## Call:
## HoltWinters(x = movie.ts)
##
## Smoothing parameters:
## alpha: 0.0475
## beta : 0.154
## gamma: 0.313
##
## Coefficients:
## [,1]
## a 4.70e+09
## b 5.96e+07
## s1 -1.15e+09
## s2 1.73e+09
## s3 3.53e+08
## s4 1.03e+09
## [1] 4.17e+19